number of “a”s is about 3 times the number of b”s. Therefore, some methods may tend to favor “a”
values, which is another danger for healthy predictions.
Furthermore, to achieve a better understanding about the data, statistics of the features are
briefly analyzed by the skim function and correlation matrix, results of which can be found below.
In addition, further analysis on continuous variables was carried out by the use of pandas profiling
module. As a result, some features were found to be redundant and no significant abnormality nor
difference was detected between the distributions of the train and test data.
Table 1: Skim Function on Train Data
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 x1 0 1 30.1 4.70 13 27 30 33 50 ▃▇▂▁
2 x2 0 1 0.671 0.470 0 0 1 1 1 ▃▁▁▁▇
3 x3 0 1 0.662 0.473 0 0 1 1 1 ▅▁▁▁▇
4 x4 0 1 0.690 0.462 0 0 1 1 1 ▃▁▁▁▇
5 x5 0 1 9.08 5.54 0 4 9 14 18 ▇▇▆▇▇
6 x6 0 1 8.99 5.61 0 4 9 14 18 ▇▇▅▆▇
7 x7 0 1 9.11 5.50 0 4 9 14 18 ▇▇▆▇▇
8 x8 0 1 30.2 5.60 13 26 30 34 49 ▅▇▃▁
9 x9 0 1 101. 58.3 0.1 49.5 101. 153. 200 ▇▇▇▇▇
10 x10 0 1 99.7 57.7 0 49.2 99.6 150. 200. ▇▇▇▇▇
11 x11 0 1 99.7 56.9 0.1 52.4 97.5 148. 200. ▇▇▇▇▇
12 x12 0 1 0.343 0.475 0 0 0 1 1 ▇▁▁▁▅
13 x13 0 1 0.0333 0.179 0 0 0 0 1 ▇▁▁▁▁
14 x14 0 1 406. 118. 20 404 404 454 999 ▇▃▁▁
15 x15 0 1 0.850 0.358 0 1 1 1 1 ▂▁▁▁▇
16 x16 0 1 0.113 0.316 0 0 0 0 1 ▇▁▁▁▁
17 x17 0 1 0.239 0.426 0 0 0 0 1 ▇▁▁▁▂
18 x18 0 1 0.0203 0.141 0 0 0 0 1 ▇▁▁▁▁
19 x19 0 1 0.0444 0.206 0 0 0 0 1 ▇▁▁▁▁
20 x20 0 1 0.0458 0.209 0 0 0 0 1 ▇▁▁▁▁
21 x21 0 1 0.0284 0.166 0 0 0 0 1 ▇▁▁▁▁
22 x22 0 1 0.0473 0.212 0 0 0 0 1 ▇▁▁▁▁
23 x23 0 1 0.484 0.500 0 0 0 1 1 ▇▁▁▁▇
24 x24 0 1 0.104 0.305 0 0 0 0 1 ▇▁▁▁▁
25 x25 0 1 0.120 0.325 0 0 0 0 1 ▇▁▁▁▁
26 x26 0 1 0.00820 0.0902 0 0 0 0 1 ▇▁▁▁▁
27 x27 0 1 128. 70.1 14 79 120 159 570 ▇▆▁▁▁
28 x28 0 1 0.101 0.302 0 0 0 0 1 ▇▁▁▁▁
29 x29 0 1 0.0342 0.182 0 0 0 0 1 ▇▁▁▁▁
30 x30 0 1 636. 159. 62 562 624 812 999 ▇▁▃
31 x31 0 1 0.0338 0.181 0 0 0 0 1 ▇▁▁▁▁
32 x32 0 1 425. 147. 189 311 411 522 999 ▇▇▃▁▁
33 x33 0 1 0.0333 0.179 0 0 0 0 1 ▇▁▁▁▁
34 x34 0 1 0.0598 0.237 0 0 0 0 1 ▇▁▁▁▁
35 x35 0 1 0.0661 0.248 0 0 0 0 1 ▇▁▁▁▁
36 x36 0 1 20.0 93.4 0 0 0 0 845 ▇▁▁▁▁
37 x37 0 1 0.000482 0.0220 0 0 0 0 1 ▇▁▁▁▁
38 x38 0 1 0.142 0.349 0 0 0 0 1 ▇▁▁▁▁
39 x39 0 1 0.124 0.330 0 0 0 0 1 ▇▁▁▁▁
40 x40 0 1 0.124 0.330 0 0 0 0 1 ▇▁▁▁▁
41 x41 0 1 0.737 0.441 0 0 1 1 1 ▃▁▁▁▇
42 x42 0 1 10.5 73.3 0 0 0 0 999 ▇▁▁▁▁
43 x43 0 1 0.0270 0.162 0 0 0 0 1 ▇▁▁▁▁
44 x44 0 1 0.692 0.462 0 0 1 1 1 ▃▁▁▁▇
45 x45 0 1 0.0603 0.238 0 0 0 0 1 ▇▁▁▁▁
46 x46 0 1 0.00723 0.0848 0 0 0 0 1 ▇▁▁▁▁